-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add client interface. #16
Conversation
WalkthroughThe pull request introduces a quick start guide for the Spider distributed task executor, detailing its core functionalities such as task creation, execution, and cluster setup. It adds several new classes for task management, data handling, and error handling, alongside modifications to the CMake configuration and formatting files. The task registration macro is renamed for consistency, and various include directives are reorganized for improved structure. These changes collectively enhance the usability and modularity of the Spider framework. Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit unclear on how all the APIs fit together concretely and I think someone who is seeing these headers for the first time would be a bit lost. I guess we will need to iterate on these a few times to get them to a point where they are clear (and meet the style guide). I think the first step is if you can add one or more readmes describing the layout of these APIs and how a user can use these APIs. Docstrings for each class will also help significantly.
src/spider/client/Data.hpp
Outdated
private: | ||
std::unique_ptr<DataImpl> m_impl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Private declarations should come after public.
src/spider/client/Data.hpp
Outdated
* Gets the values stored in Data. | ||
* @return value stored in Data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Gets the values stored in Data. | |
* @return value stored in Data. | |
* @return The stored value. |
As our internal guidelines mention, for most getters, we can omit the docstring description and only describe the return value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the special case. This is not an ordinary getter. It accesses data storage under the hood and could throw exception, or return error once we get error included in the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since neither the exception or error are currently in the code, let's write the docstring according to what exists (not what will exist later). When we eventually add the error information, we can update the docstring appropriately.
src/spider/client/Data.hpp
Outdated
auto get() -> T; | ||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto get() -> T; | |
/** | |
auto get() -> T; | |
/** |
Add an empty line between methods.
src/spider/client/Data.hpp
Outdated
/** | ||
* Indicates that the data is persisted and should not be rollbacked | ||
* on failure recovery. | ||
*/ | ||
// Not implemented in milestone 1 | ||
// void mark_persist(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience, it's a bad idea to add code that we'll use in the future since that day may never come. If that day comes, we will remember what to do based on the design doc rather than this commented code.
src/spider/client/Data.hpp
Outdated
* Sets the key for the data. If no key is provided, Spider generates a key. | ||
* @param key of the data | ||
*/ | ||
auto key(std::string const& key) -> Data<T>::Builder&; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return value is missing from docstring.
src/spider/client/Data.hpp
Outdated
/** | ||
* Sets locality list of the data. | ||
* @param nodes nodes that has locality | ||
* @param hard true if the locality list is a hard requirement, false otherwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These docstrings are a bit unclear to me. I guess the idea is that if hard=false
, the data can be accessed from any node in the cluster, but if possible, it should be accessed from the nodes specified in this list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Hard requirement means data can only be accessed on nodes in the locality list. Will change the docstring to make it clear.
src/spider/client/Data.hpp
Outdated
* Sets the key for the data. If no key is provided, Spider generates a key. | ||
* @param key of the data | ||
*/ | ||
auto key(std::string const& key) -> Data<T>::Builder&; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be clearer to name all these setters as set_xxx
where xxx is the thing being set.
src/spider/client/Data.hpp
Outdated
// Not implemented for milestone 1 | ||
// auto rollback(std::function<const T&()> const& f) -> Data<T>::Builder&; | ||
/** | ||
* Builds the data. Stores the value of data into storage with locality list, persisted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this class store the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internally all values are serialized and stored in data storage.
src/spider/client/Data.cpp
Outdated
|
||
class DataImpl {}; | ||
|
||
template <class T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should document template parameters as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed quickstart.
docs/QuickStart.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our naming convention is to use kebab-case for markdown files except for the main readme in a repo which is called README.md
. So this file should be called quick-start.md
. I've added this to our internal guidelines.
docs/QuickStart.md
Outdated
@@ -0,0 +1,204 @@ | |||
# Spider Quick Start Guide |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our convention is to use sentence case for headings rather than capitlizing every word. So this should be written as # Spider quick start guide
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed all titles, but I treat Spider as a special terminology and keep it capitalized.
docs/QuickStart.md
Outdated
# Spider Quick Start Guide | ||
|
||
## Set Up Spider | ||
To get started, first start a database supported by Spider, e.g. MySql. Second, start a scheduler and connect it to the database by running `spider start --scheduler --db <db_url> --port <scheduler_port>`. Third, start some workers and connect them to the database by running `spider start --worker --db <db_url>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to use lists rather than a block of text. Lists are faster to read than blocks of text.
docs/QuickStart.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a linter to do this later, but treat Markdown as code and wrap lines to 100 characters. This makes it easier to review and read (not everyone has a good Markdown renderer on their local machine).
docs/QuickStart.md
Outdated
# Spider Quick Start Guide | ||
|
||
## Set Up Spider | ||
To get started, first start a database supported by Spider, e.g. MySql. Second, start a scheduler and connect it to the database by running `spider start --scheduler --db <db_url> --port <scheduler_port>`. Third, start some workers and connect them to the database by running `spider start --worker --db <db_url>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a caveat for starting the worker below. You should note that here otherwise someone will try to start the worker, it won't work, and they will give up.
docs/QuickStart.md
Outdated
} | ||
``` | ||
|
||
However, it is impossible to get the return value of the task graph from a client. We have a solution by sharing data using key-value store, which will be discussed later. Another solution is to run task or task graph inside a task and wait for its value, just like a client. This solution is closer to the conventional function call semantic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, it is impossible to get the return value of the task graph from a client.
This sentence doesn't make sense since the previous section has an example where the client is retrieving the result of a task graph.
Reading the later parts of this guide, I guess you mean that you can't return the value of a dynamically created task to the client (which is intuitive since the interface doesn't provide a mechanism to do so).
docs/QuickStart.md
Outdated
|
||
Spider lets user pass the metadata of these data around in `spider::Data` objects. `Data` stores the value of the metadata information of external data, and provides crucial information to Spider for correct and efficient scheduling and failure recovery. `Data` stores a list of nodes which has locality of the external data, and user can specify if locality is a hard requirement, i.e. task can only run on the nodes in locality list. `Data` can include a `cleanup`function, which will run when the `Data` object is no longer reference by any task and client. `Data` has a persist flag to represent that external data is persisted and do not need to be cleaned up. | ||
|
||
```c++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example is complicated and should definitely have a description. To not have a description means that we're slowing down the user since they first need to infer what the goal of the example's code is, then they can focus on the details of using the framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example is also a good reason why our internal guidelines say that we should organize things from public to private. If you apply that thinking here, we would put main
first and then the task methods. That is the order that a user should read the methods (the opposite order requires the user to remember a lot of context before they get to main
).
docs/QuickStart.md
Outdated
.build(HdfsFile { "/path/to/input" }); | ||
spider::Future<spider::Data<HdfsFile>> future = spider::run( | ||
spider::bind(map, filter), | ||
input); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input
is not a POD, so that violates the constraint mentioned above about task arguments being POD.
docs/QuickStart.md
Outdated
To get started, first start a database supported by Spider, e.g. MySql. Second, start a scheduler and connect it to the database by running `spider start --scheduler --db <db_url> --port <scheduler_port>`. Third, start some workers and connect them to the database by running `spider start --worker --db <db_url>`. | ||
|
||
## Start a Client | ||
Client first creates a Spider client driver and connects it to the database. Spider automatically cleans up the resource in driver's destructor, but you can close the driver to release the resource early. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have an intro section where we describe the actors and architecture in the system (client, scheduler, worker, database, etc.). It doesn't need to be too detailed (that can be in another doc), but it should provide enough context for the user to visualize what we're talking about.
docs/QuickStart.md
Outdated
## Note on Worker Setup | ||
The setup section said that we can start a worker by running `spider start --worker --db <db_url>`. This is oversimplified. The worker has to know the function it will run. | ||
|
||
When user compiles the client code, an executable and a library are generated. The executable executes the client code as expected. The library contains all the functions registered by user. Worker needs to run with a copy of this library. The actual commands to start a worker is `spider start --worker --db <db_url> --libs [client_libraries]`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, so in the guide above, wouldn't it be easier to just say that we need to create a task library with all the tasks the user wants to run? Then we can reference those tasks in the client code. Right now, the guide says we need to register the tasks by calling spider::register_task
which sounds pretty opaque to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User is responsible to create a task library and the client executable. Actually a good practice for user is to put the spider::register_task
with the tasks' declaration instead of main, so maybe I should task registration into "Create a task section?
docs/quick_start.md
Outdated
`spider::Data` that is directly used as input. Spider requires that the types of `Task` or | ||
`TaskGraph` outputs or POD type or `spider::Data` matches the input types of child task. | ||
|
||
Binding the tasks together forms a dependencies among tasks, which is represented by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- How do you plan to communicate (serialize) the task calls (name + args) from the client to the server?
- Let's say a task takes two inputs. Does this interface support using a constant for one input and a task for the other?
- If so, do you support a task graph that looks like this?
flowchart TD leaf["foo(int, int) -> int"] parent["bar(int, int) -> int"] 3 --> leaf 4 --> parent 5 --> parent parent --> leaf
- If so, that means any arguments the user passes into
run
get passed to the inputs in the task graph with a kind of DFS ordering. Is that true?
- If so, do you support a task graph that looks like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Here comes the use of
spider::register_task
, which records the mapping between function name and the function pointer. When the user callsrun
with the function pointer, the library actually sends the function name to the db. Worker also gets the name of the function as part of the task metadata. Since the worker links to the library withregister_task
call, it also has the mapping and can know which function to call.
As for arguments, their types are stored in db (serialized as string right now), and values are serialized into string. - Yes.
i. Yes.
ii. No.spider::bind
needs to bind all inputs of the child task. Thus, the value 3 must be the an argument in bind. 4 and 5 can be passed inrun
. However, it is possible that there are multiple first layer tasks, i.e. tasks with no parents. The input of task graph is the product of all first layer tasks' output.
It is possible to support passing 3 as an argument inrun
for the above example. We can have a special placeholder type forbind
. In such case, DFS is an intuitive order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can you explain how:
register_task
maps a function pointer into a function name?run
turns a function pointer into a function name?
- Can you explain this point with an example?
However, it is possible that there are multiple first layer tasks, i.e. tasks with no parents. The input of task graph is the product of all first layer tasks' output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For client and worker, when the library is loaded, it will create a mapping between function name and function pointer.
i. I am thinking of turningregister_task
into a macro to get the function name at compile time. It then stores the mapping between the function name to the function pointer.
ii.run
gets the function name from the mapping stored before. - In the example below, both
bar
andbaz
has no parent, and task graph input is their inputs put together, i.e. 1, 2, 3 and 4.
graph TD
1[1]
2[2]
3[3]
4[4]
foo["foo(int, int) -> int"]
bar["bar(int, int) -> int"]
baz["baz(int, int) -> int"]
1 --> bar
2 --> bar
3 --> baz
4 --> baz
bar --> foo
baz --> foo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Makes sense.
- Gotcha. So to clarify my understanding:
- in
run
, users can only pass arguments to root tasks (tasks with no parents). - in
run
, when the user passes a list of arguments[1, 2, 3, 4]
, conceptually:- the framework will iterate over the root tasks in order;
- for each task argument, the framework will pop one argument from the list.
- in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and yes.
… in client id spider::run now returns a Job represents a running task or task graph, so user can cancel the job. User can now pass in client id when creating the driver, so the recovered client can get all jobs by client id.
Can you make the changes to the header files that we discussed on Thursday? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional comments.
src/spider/client/Spider.hpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this file doesn't correspond to a class, the filename should be lowercase.
src/spider/client/Spider.hpp
Outdated
#include <optional> | ||
#include <string> | ||
|
||
// NOLINTBEGIN(misc-include-cleaner) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be using IWYU's pragmas, right?
src/spider/client/Spider.hpp
Outdated
/** | ||
* Initializes Spider library | ||
*/ | ||
void init(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will this function do internally?
src/spider/client/Spider.hpp
Outdated
* Registers function to Spider | ||
* @param function function to register | ||
*/ | ||
template <class R, class... Args> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's replace R
and Args
with C++20 concepts that specify the exact properties of acceptable inputs. If I understand correctly, currently, return values and arguments can be a certain set of serializable values.
On that note though, if the requirement is that the value types need to be serializable, why don't we define an interface for serializable value types and then the user has the flexibility to use any serializable type they want? We could do this later, but conceptually, is it possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also use the same C++20 concepts wherever we have task inputs/outputs. This would also simplify the docstrings since we don't need to repeat what types are acceptable.
src/spider/client/Spider.hpp
Outdated
* @return job representing the running task | ||
*/ | ||
template <class R, class... Args> | ||
auto run(std::function<R(Args...)> const& task, Args&&... args) -> Job<R>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about renaming the run
functions to start
? run
implies that the function won't return until the task/taskgraph completes.
src/spider/client/Spider.hpp
Outdated
|
||
// NOLINTEND(misc-include-cleaner) | ||
|
||
namespace spider { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the functions in this file and those in KeyValueData
, I feel like it makes more sense to have a Client/Driver class that encapsulates the functions (similar to Context
). Otherwise, a user could call, for example, insert_kv
without initializing a client and then it would fail or invoke undefined behaviour, right?
src/spider/client/Job.hpp
Outdated
@@ -0,0 +1,61 @@ | |||
#ifndef SPIDER_CLIENT_FUTURE_HPP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double-check all the header guards.
src/spider/client/Data.hpp
Outdated
* | ||
* @tparam T type of the value. T must be a POD. | ||
*/ | ||
template <class T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I mentioned in Spider.hpp, we can restrict the types using C++20 concepts.
CMakeLists.txt
Outdated
# AppleClang complains about file has no symbol and abort the build. | ||
if(APPLE) | ||
set(CMAKE_CXX_ARCHIVE_CREATE "<CMAKE_AR> Scr <TARGET> <LINK_FLAGS> <OBJECTS>") | ||
set(CMAKE_CXX_ARCHIVE_FINISH "<CMAKE_RANLIB> -no_warning_for_no_symbols -c <TARGET>") | ||
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this anymore considering we've dropped support for building on macOS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
src/spider/client/Data.hpp (1)
46-74
: Add validation and improve Builder documentation.The Builder implementation could benefit from:
- Documentation about method chaining
- Validation for empty node vectors and null cleanup functions
Consider adding validation in
set_locality
:auto set_locality(std::vector<std::string> const& nodes, bool hard) -> Builder& { if (nodes.empty()) { throw std::invalid_argument("Locality nodes vector cannot be empty"); } // ... existing implementation }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
src/spider/client/Data.hpp
(1 hunks)src/spider/client/TaskContext.hpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/spider/client/TaskContext.hpp
🔇 Additional comments (5)
src/spider/client/Data.hpp (5)
1-10
: LGTM! Header organization follows best practices.
The includes are well-organized and appropriate for the required functionality.
76-78
: LGTM! Good use of pImpl idiom.
The use of pImpl pattern effectively hides implementation details and helps maintain ABI compatibility.
31-45
: 🛠️ Refactor suggestion
Improve method documentation and ordering.
- The
get()
method should document potential errors or exceptions. - The
set_locality()
parameter documentation should explain whatnodes
represents. - Per Google C++ style guide, private declarations should come after public ones.
Consider adding error handling documentation:
/**
* Retrieves the stored value.
* @return The stored value.
* @throws std::runtime_error If the data cannot be accessed or deserialized.
*/
auto get() -> T;
29-29
: Verify Serializable concept implementation.
Let's ensure the Serializable concept is properly defined and documented.
✅ Verification successful
Serializable concept is properly implemented and used
The Serializable concept is well-defined in src/spider/core/Serializer.hpp
as a composition of two concepts:
SerializableImpl
: Requires types to support msgpack serializationDeSerializableImpl
: Requires types to support msgpack deserialization
The concept is appropriately used in the codebase:
- As a template constraint for the
Data
class - As part of the
TaskIo
concept definition
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for Serializable concept definition
rg -A 5 "concept\s+Serializable"
# Search for any serialization-related functions that types must implement
ast-grep --pattern 'template <$_>
concept Serializable = $_'
Length of output: 592
Script:
#!/bin/bash
# Let's also check for DeSerializableImpl concept definition
rg -A 5 "concept\s+DeSerializableImpl"
# And check where this Serializable concept is being used
rg "Serializable\s*[<\s]" -A 2
Length of output: 930
14-29
:
Fix inconsistencies in example code.
The example code has several issues:
set_locality
example uses a single string, but the method expects a vector- There's a typo in
cont&
instead ofconst&
- The cleanup function signature doesn't match the method declaration
Apply this diff to fix the example:
- .set_locality({ "node_address" }, true)
- .set_cleanup_func([](std::string cont& path) { std::remove(path); })
+ .set_locality(std::vector<std::string>{"node_address"}, true)
+ .set_cleanup_func([](std::string const& path) { std::remove(path.c_str()); })
Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (3)
src/spider/client/Data.hpp (3)
29-30
: Consider adding error handling mechanism.The
Serializable
concept constraint is good, but the class should provide a way to handle potential serialization errors.Consider adding error handling through either:
- Exception handling
- std::expected (C++23) or tl::expected
- Error codes
37-44
: Enhance documentation for locality parameters.The parameter documentation could be more descriptive:
- What constitutes a valid node address?
- What happens if an invalid node is specified?
- What are the implications of hard vs soft locality?
46-74
: Consider adding noexcept specifications and move operations.The Builder implementation could be enhanced:
- Mark methods as noexcept where applicable
- Add move operations for better performance
- Consider making cleanup function noexcept
Example improvement:
auto set_cleanup_func(std::function<void(T const&) noexcept> f) noexcept -> Builder&;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (6)
src/spider/client/Data.hpp
(1 hunks)src/spider/client/Driver.hpp
(1 hunks)src/spider/client/Job.hpp
(1 hunks)src/spider/client/TaskContext.hpp
(1 hunks)src/spider/client/task.hpp
(1 hunks)src/spider/client/type_utils.hpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- src/spider/client/Driver.hpp
- src/spider/client/Job.hpp
- src/spider/client/TaskContext.hpp
- src/spider/client/task.hpp
- src/spider/client/type_utils.hpp
🔇 Additional comments (2)
src/spider/client/Data.hpp (2)
1-10
: LGTM! Well-organized includes and proper header guards.
The includes are properly organized with standard libraries first, followed by local includes, and all are necessary for the implementation.
20-25
:
Fix inconsistencies in the example code.
The example code doesn't match the actual method signatures:
- The cleanup function lambda has incorrect signature
- The locality parameter is passed as a single string instead of a vector
Apply this diff to fix the example:
auto disk_file_data = spider::Data<std::string>::Builder()
- .set_locality({"node_address"}, true)
- .set_cleanup_func([](std::string cont& path) { std::remove(path); })
+ .set_locality(std::vector<std::string>{"node_address"}, true)
+ .set_cleanup_func([](std::string const& path) { std::remove(path.c_str()); })
.build("/path/of/file");
Likely invalid or redundant comment.
private: | ||
std::unique_ptr<DataImpl> m_impl; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add destructor declaration for pImpl idiom.
When using the pImpl idiom with std::unique_ptr, you need to declare the destructor in the header and define it in the implementation file where the complete type is available.
Add the following declaration:
~Data(); // Declaration only, define in .cpp file
src/spider/client/Data.hpp
Outdated
#ifndef SPIDER_CLIENT_DATA_HPP | ||
#define SPIDER_CLIENT_DATA_HPP | ||
|
||
#include <functional> | ||
#include <memory> | ||
#include <string> | ||
#include <vector> | ||
|
||
#include "../core/Serializer.hpp" | ||
|
||
namespace spider { | ||
class DataImpl; | ||
|
||
/** | ||
* A representation of data stored on external storage. This class allows the user to define: | ||
* - how the data should be cleaned up (garbage collected) once it is no longer referenced. | ||
* - the locality of the data. | ||
* | ||
* Example: | ||
* @code{.cpp} | ||
* auto disk_file_data = spider::Data<std::string>::Builder() | ||
* .set_locality({"node_address"}, true) | ||
* .set_cleanup_func([](std::string cont& path) { std::remove(path); }) | ||
* .build("/path/of/file"); | ||
* @endcode | ||
* | ||
* @tparam T Type of the value. | ||
*/ | ||
template <Serializable T> | ||
class Data { | ||
public: | ||
/** | ||
* @return The stored value. | ||
*/ | ||
auto get() -> T; | ||
|
||
/** | ||
* Sets the data's locality, indicated by the nodes that contain the data. | ||
* | ||
* @param nodes | ||
* @param hard Whether the data is only accessible from the given nodes (i.e., the locality is a | ||
* hard requirement). | ||
*/ | ||
void set_locality(std::vector<std::string> const& nodes, bool hard); | ||
|
||
class Builder { | ||
public: | ||
/** | ||
* Sets the data's locality, indicated by the nodes that contain the data. | ||
* | ||
* @param nodes | ||
* @param hard Whether the data is only accessible from the given nodes (i.e., the locality | ||
* is a hard requirement. | ||
* @return self | ||
*/ | ||
auto set_locality(std::vector<std::string> const& nodes, bool hard) -> Builder&; | ||
|
||
/** | ||
* Sets the cleanup function for the data. This function will be called when the data is no | ||
* longer referenced. | ||
* | ||
* @param f | ||
* @return self | ||
*/ | ||
auto set_cleanup_func(std::function<void(T const&)> const& f) -> Builder&; | ||
|
||
/** | ||
* Builds the data object. | ||
* | ||
* @param t Value of the data | ||
* @return The built object. | ||
*/ | ||
auto build(T const& t) -> Data; | ||
}; | ||
|
||
private: | ||
std::unique_ptr<DataImpl> m_impl; | ||
}; | ||
} // namespace spider | ||
|
||
#endif // SPIDER_CLIENT_DATA_HPP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Template implementation file missing
The code appears to be a header-only template library where:
Data.hpp
in the client directory defines the template interfaceData.hpp
in the core directory contains a different concrete implementation- The template implementation is missing, and
DataImpl
is only forward-declared
Required actions:
- Implement the template methods for
Data<T>
either in the header file or in a separate.ipp
file - Provide the complete implementation of the
DataImpl
class
🔗 Analysis chain
Verify implementation file exists.
Ensure that all declared methods are implemented in the corresponding .cpp file.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for the implementation file
fd "Data\.(cpp|cc|cxx)$" src/spider/client/
# Check if all methods are implemented
rg "Data<.*>::\w+" src/spider/client/
Length of output: 174
Script:
#!/bin/bash
# Let's check if there's a template implementation in the header file itself
rg -U "template.*class Data.*\{.*\}" -A 100 src/spider/client/Data.hpp
# Also check if there's any inline implementation
rg "inline.*Data<.*>::\w+" src/spider/client/
# Check for potential implementation in other directories
fd "Data(Impl)?\.(cpp|cc|cxx|hpp|h)$" src/spider/
Length of output: 224
Script:
#!/bin/bash
# Let's examine the core/Data.hpp to understand if it contains the implementation
cat src/spider/core/Data.hpp
# Also check for DataImpl definition
rg "class DataImpl" src/spider/
Length of output: 2017
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- All of the
Driver
andTaskContext
methods should throw an exception if they can't connect to the storage component, right? If so, can we document that with@throw
? - Do we need any of the forward declared
XxxImpl
classes in this PR? - I'm still working on the quick-start guide, so in the interest of time, can we remove it from this pull request and then I'll submit another pull request to add it?
* fails and then needs to reconnect and retrieve all previously created jobs. NOTE: It is | ||
* undefined behaviour for two clients to concurrently use the same ID. | ||
*/ | ||
Driver(std::string const& storage_url, boost::uuids::uuid id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method should throw an exception if the client ID is already in use, right?
src/spider/client/TaskContext.hpp
Outdated
#include "TaskGraph.hpp" | ||
|
||
namespace spider { | ||
class ContextImpl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this in this PR?
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
src/spider/client/Data.hpp (2)
37-44
: Enhance locality documentationThe documentation should clarify:
- The implications of hard vs soft locality
- The behaviour when accessing data from non-specified nodes
- Whether locality can be changed after data is created
46-74
: Enhance Builder implementationConsider the following improvements:
- Add move semantics support
- Add parameter validation
- Make Builder moveable but non-copyable
class Builder { public: Builder() = default; Builder(Builder&&) noexcept = default; Builder& operator=(Builder&&) noexcept = default; Builder(const Builder&) = delete; Builder& operator=(const Builder&) = delete; // ... existing methods ... private: void validate_nodes(std::vector<std::string> const& nodes) { if (nodes.empty()) { throw std::invalid_argument("Nodes vector cannot be empty"); } } };
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
src/spider/client/Data.hpp
(1 hunks)
🔇 Additional comments (3)
src/spider/client/Data.hpp (3)
20-25
: Update example code to match method signatures.
The example code doesn't match the actual method signatures:
set_locality
expects a vector of strings, not a single stringset_cleanup_func
signature isvoid(T const&)
, which matches the example but differs from what was previously discussed
- auto disk_file_data = spider::Data<std::string>::Builder()
- .set_locality({"node_address"}, true)
- .set_cleanup_func([](std::string const& path) { std::filesystem::remove(path); })
- .build("/path/of/file");
+ auto disk_file_data = spider::Data<std::string>::Builder()
+ .set_locality(std::vector<std::string>{"node_address"}, true)
+ .set_cleanup_func([](std::string const& path) { std::filesystem::remove(path); })
+ .build("/path/of/file");
76-78
: Add destructor declaration for pImpl idiom
When using the pImpl idiom with std::unique_ptr, you need to declare the destructor in the header.
private:
std::unique_ptr<DataImpl> m_impl;
+ ~Data();
29-30
: Template implementation file missing
The template implementation is missing. Consider:
- Adding implementation in a separate .ipp file
- Including the .ipp file at the end of this header
/** | ||
* @return The stored value. | ||
*/ | ||
auto get() -> T; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider adding error handling to get()
The get()
method might fail when accessing the underlying storage. Consider:
- Adding error handling using std::expected (C++23) or Result type
- Documenting potential failure scenarios
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (3)
src/spider/client/Exception.hpp (1)
11-30
: Consider implementing a base exception class.To improve maintainability and reduce code duplication, consider implementing a base Spider exception class that handles common functionality like message formatting and storage. This would also make it easier to add new exception types in the future.
Example structure:
class SpiderException : public std::exception { protected: explicit SpiderException(std::string message) : m_message(std::move(message)) {} mutable std::string m_message; public: [[nodiscard]] const char* what() const noexcept override { return m_message.c_str(); } }; class ConnectionException final : public SpiderException { public: explicit ConnectionException(std::string addr) : SpiderException(fmt::format("Cannot connect to storage {}.", addr)) {} };src/spider/client/Data.hpp (1)
37-44
: Improve parameter names and documentation for set_localityThe parameter names and documentation could be more descriptive:
- Rename
nodes
tonode_addresses
for clarity- Expand documentation to explain the implications of hard vs soft locality requirements
- * @param nodes + * @param node_addresses List of node addresses where the data is stored - * @param hard Whether the data is only accessible from the given nodes (i.e., the locality is a - * hard requirement). + * @param hard If true, data can only be accessed from the specified nodes. + * If false, data can be accessed from any node but preferably from the specified nodes.src/spider/client/Job.hpp (1)
12-14
: Address the TODO: Implement error handling withstd::expected
orBoost.Outcome
The TODO comment suggests enhancing error handling by using
std::expected
orBoost.Outcome
. Implementing this would allow users to retrieve job results in a more robust and less error-prone manner, simplifying the API.Would you like assistance in implementing this enhancement or opening a GitHub issue to track this task?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (6)
src/spider/client/Data.hpp
(1 hunks)src/spider/client/Driver.hpp
(1 hunks)src/spider/client/Exception.hpp
(1 hunks)src/spider/client/Job.hpp
(1 hunks)src/spider/client/TaskContext.hpp
(1 hunks)src/spider/client/TaskGraph.hpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- src/spider/client/Driver.hpp
- src/spider/client/TaskContext.hpp
- src/spider/client/TaskGraph.hpp
🔇 Additional comments (7)
src/spider/client/Exception.hpp (1)
1-10
: LGTM! Header guards and includes are properly implemented.
The header guards follow the standard naming convention, and all necessary headers are included for the required functionality.
src/spider/client/Data.hpp (4)
20-25
: Update example code to match method signatures.
The example code doesn't match the actual method signatures:
set_cleanup_func
expectsvoid(T const&)
but example shows filesystem operationset_locality
is shown with a single string but expects vector
32-35
: Consider adding error handling to get()
The method should handle potential failures when accessing the underlying storage. Consider using std::expected (C++23) or a Result type.
78-80
: Add destructor declaration for pImpl idiom
When using the pImpl idiom with std::unique_ptr, you need to declare the destructor in the header file.
29-80
: Template implementation file missing
The template implementation is missing:
- Methods are declared but not implemented
- Implementation must be available at compile time
src/spider/client/Job.hpp (2)
60-60
: Consider adding runtime checks to prevent undefined behaviour
As previously noted, the methods get_result()
and get_error()
may cause undefined behaviour if called when the job is not in the correct state. Adding runtime checks to throw exceptions in these methods would improve robustness.
Also applies to: 71-71
36-36
: Provide implementations for template methods to prevent linker errors
As previously mentioned, the methods declared in the Job
template class lack implementations. This could lead to linker errors. Consider implementing these methods inline or in an included template implementation file.
Also applies to: 43-43, 50-50, 60-60, 71-71
src/spider/client/Exception.hpp
Outdated
class ConnectionException final : std::exception { | ||
public: | ||
auto what() -> std::string { return fmt::format("Cannot connect to storage {}.", m_addr); } | ||
|
||
private: | ||
std::string m_addr; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical issues in ConnectionException implementation.
The class has several issues that need to be addressed:
- Inheritance from std::exception should be public
- what() method doesn't properly override std::exception::what()
- Missing constructor to initialize m_addr
- Missing address parameter in error message
Apply this diff to fix the issues:
-class ConnectionException final : std::exception {
+class ConnectionException final : public std::exception {
public:
- auto what() -> std::string { return fmt::format("Cannot connect to storage {}.", m_addr); }
+ explicit ConnectionException(std::string addr) : m_addr(std::move(addr)) {}
+
+ [[nodiscard]] const char* what() const noexcept override {
+ m_what = fmt::format("Cannot connect to storage {}.", m_addr);
+ return m_what.c_str();
+ }
private:
std::string m_addr;
+ mutable std::string m_what; // Mutable to allow modification in const what()
};
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
class ConnectionException final : std::exception { | |
public: | |
auto what() -> std::string { return fmt::format("Cannot connect to storage {}.", m_addr); } | |
private: | |
std::string m_addr; | |
}; | |
class ConnectionException final : public std::exception { | |
public: | |
explicit ConnectionException(std::string addr) : m_addr(std::move(addr)) {} | |
[[nodiscard]] const char* what() const noexcept override { | |
m_what = fmt::format("Cannot connect to storage {}.", m_addr); | |
return m_what.c_str(); | |
} | |
private: | |
std::string m_addr; | |
mutable std::string m_what; // Mutable to allow modification in const what() | |
}; |
src/spider/client/Exception.hpp
Outdated
class DriverIdUsedException final : std::exception { | ||
public: | ||
auto what() -> std::string { | ||
return fmt::format("Driver id {} already used.", boost::uuids::to_string(m_id)); | ||
} | ||
|
||
private: | ||
boost::uuids::uuid m_id; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical issues in DriverIdUsedException implementation.
The class has similar issues to ConnectionException:
- Inheritance from std::exception should be public
- what() method doesn't properly override std::exception::what()
- Missing constructor to initialize m_id
Apply this diff to fix the issues:
-class DriverIdUsedException final : std::exception {
+class DriverIdUsedException final : public std::exception {
public:
- auto what() -> std::string {
- return fmt::format("Driver id {} already used.", boost::uuids::to_string(m_id));
- }
+ explicit DriverIdUsedException(boost::uuids::uuid id) : m_id(id) {}
+
+ [[nodiscard]] const char* what() const noexcept override {
+ m_what = fmt::format("Driver id {} already used.", boost::uuids::to_string(m_id));
+ return m_what.c_str();
+ }
private:
boost::uuids::uuid m_id;
+ mutable std::string m_what; // Mutable to allow modification in const what()
};
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
class DriverIdUsedException final : std::exception { | |
public: | |
auto what() -> std::string { | |
return fmt::format("Driver id {} already used.", boost::uuids::to_string(m_id)); | |
} | |
private: | |
boost::uuids::uuid m_id; | |
}; | |
class DriverIdUsedException final : public std::exception { | |
public: | |
explicit DriverIdUsedException(boost::uuids::uuid id) : m_id(id) {} | |
[[nodiscard]] const char* what() const noexcept override { | |
m_what = fmt::format("Driver id {} already used.", boost::uuids::to_string(m_id)); | |
return m_what.c_str(); | |
} | |
private: | |
boost::uuids::uuid m_id; | |
mutable std::string m_what; // Mutable to allow modification in const what() | |
}; |
|
||
/** | ||
* NOTE: It is undefined behavior to call this method for a job that is not in the `Succeed` | ||
* state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the job state names in the documentation
There are inconsistencies in the documentation comments regarding job state names:
- In line 54, '
Succeed
' should be 'Succeeded
' to match the enumJobStatus::Succeeded
. - In line 63, '
Fail
' should be 'Failed
' to match the enumJobStatus::Failed
.
Apply the following changes to correct the documentation:
- * NOTE: It is undefined behaviour to call this method for a job that is not in the `Succeed` state.
+ * NOTE: It is undefined behaviour to call this method for a job that is not in the `Succeeded` state.
- * NOTE: It is undefined behaviour to call this method for a job that is not in the `Fail` state.
+ * NOTE: It is undefined behaviour to call this method for a job that is not in the `Failed` state.
Also applies to: 63-63
docs/quick_start.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still need to remove this, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
src/spider/client/Exception.hpp (2)
14-15
: Consider using std::string_view for better performance.The constructor could be more efficient by accepting a std::string_view instead of const std::string&, especially since the parameter is only used for formatting.
- explicit ConnectionException(std::string const& addr) + explicit ConnectionException(std::string_view addr)
26-28
: Improve message formatting consistency.The message formatting spans multiple lines with inconsistent indentation. Consider reformatting for better readability:
- : m_message( - fmt::format("Driver ID {} is currently in use.", boost::uuids::to_string(id)) - ) {} + : m_message(fmt::format("Driver ID {} is currently in use.", + boost::uuids::to_string(id))) {}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (6)
src/spider/client/Data.hpp
(1 hunks)src/spider/client/Driver.hpp
(1 hunks)src/spider/client/Exception.hpp
(1 hunks)src/spider/client/Job.hpp
(1 hunks)src/spider/client/TaskContext.hpp
(1 hunks)src/spider/client/TaskGraph.hpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- src/spider/client/Data.hpp
- src/spider/client/Driver.hpp
- src/spider/client/Job.hpp
- src/spider/client/TaskContext.hpp
- src/spider/client/TaskGraph.hpp
🔇 Additional comments (2)
src/spider/client/Exception.hpp (2)
1-10
: LGTM! Header guards and includes are well-organized.
The includes are properly organized with standard library headers first, followed by third-party libraries. All necessary dependencies are included.
11-35
: Well-designed exception hierarchy!
The implementation follows good C++ practices with:
- Proper inheritance from std::exception
- Consistent design patterns across exception classes
- Thread-safe message handling
- Clear and informative error messages
Description
Add client interface.
Validation performed
Summary by CodeRabbit
Release Notes
New Features
Data
class for managing external data storage and aDriver
class for client interactions with the Spider framework.Job
class for tracking running tasks and aTaskContext
class for managing task execution.Bug Fixes
Documentation
Refactor